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ABSTRACT 


Automatic assessment of the quality of classroom discourse can 
have a transformative effect on research and practice on improving 
teaching effectiveness. We improve on a previous automated 
method to measure teacher authentic questions — open-ended 
questions without pre-scripted responses that predict student 
achievement growth — using classroom audio and expert question 
codes from two sources: (1) a large archival database of text 
transcripts of 428 class-sessions from 116 classrooms, and (2) a 
newly collected sample of 132 high-quality audio recordings with 
automatic speech recognition transcripts from 27 classrooms. 
Whereas previous work utilized a “closed vocabulary” approach, 
consisting of 732 pre-defined word, sentence, and discourse level 
features, the present “open vocabulary” approach exclusively 
utilized word and phrase counts from the transcripts themselves. 
The two approaches yielded substantial, but statistically equivalent, 
correlations with gold-standard human codes of authenticity 
(Pearson r’s of 0.396 vs. 0.424 and 0.602 vs. 0.613 for datasets 1 
and 2, respectively). Importantly, averaging estimates from the two 
approaches resulted in statistically significant improvements over 
either approach (r’s of 0.492 and 0.686 for datasets 1 and 2, 
respectively). We discuss implications of our findings for 
automated analysis of classroom discourse. 
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1. INTRODUCTION 
(Example 1) 


Teacher: “How does a person become a noble?” 
Student: “They re born into it.” 


Teacher: “They’re born into it, right? It’s by family. It gets passed 
down so if you're a noble, your child would be a noble, their child 
would be...it’s a tradition, right?” 
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(Example 2) 


Teacher: “How did that make you guys feel, I mean what was your 
gut reaction to all that?” 


Student: “Ashamed.” 


Teacher: “Ashamed in what way?” 


Consider these discourse exchanges between a teacher and his/her 
students from an actual classroom. The first follows the oft-used, 
but ineffective, Initiate-Response-Evaluate (IRE) [40] mode of 
questioning. Now contrast this with the second case, where the 
teacher asks an open-ended question or a question without a pre- 
scripted response. Although it only elicited a one-word answer 
from the student, the teacher withheld evaluation, instead building 
on the student’s response, thereby “opening up” the conversation. 


Such questions — called authentic questions — whose answers are 
not presupposed by the teacher (e.g. “Do you think Abigail is going 
to tell the truth?” [33]) are a core dimension of dialogic instruction 
related to student engagement and achievement growth [24, 25, 42], 
and are central to many conceptual models of effective discourse 
practices [39, 50, 63]. Prior research utilized expert human coders 
to identify discourse practices at the level of individual questions 
and thus provided exceptionally precise measures of instructional 
practice. Our goal is to precisely estimate the prevalence rate of 
teacher authentic questions using fully-automated methods. 


Why bother in the first place? It is because teacher observation has 
become increasingly central to educational research and school 
improvement efforts [2, 26, 28, 35, 58]. Observations of classroom 
practice are valuable because they identify specific domains of 
practice for improvement [36] and can target dimensions of 
schooling not captured by test scores, such as socialization 
processes in elementary school [32]. Classroom observations also 
enhance school principals’ role in managing teachers’ work [30]. 
Yet current in-person observational methods are logistically 
complex, require observer training, are an expensive allocation of 
administrators’ time [4], and simply do not scale. 


Can computers help? We think so, and report the results of ongoing 
research efforts to automate the analysis of teacher question-asking 
behavior, a common component across various well-known 
observation protocols (e.g., Domain 3 of Danielson’s Framework 
for Teaching [16]; PLATO’s Classroom Discourse Element [27]). 
Our specific emphasis on authentic questions is motivated by the 
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strong research base linking them to engagement and achievement 
as cited above. 


1.1 Related Work 


There has been considerable work on detecting questions from text 
[1], with fewer studies focusing on audio [8, 45, 61]. These studies 
also largely focus on general question detection from meetings and 
other interactions, which is quite different from the present goal of 
detecting authentic questions from real-world classrooms. 
Blanchard et al. [6] and Donnelly et al. [20] investigated question 
detection from classroom audio, but again, their emphasis was on 
discriminating questions from other utterances, which is a related 
but distinct problem from authenticity detection. There has also 
been research on automated analysis of teacher and student 
discourse [18, 19, 62], but these studies emphasize modeling of 
general instructional activities (e.g., distinguishing between lecture 
vs. group work vs. discussion) rather than authentic questions. 


To our knowledge, there have only been three studies germane to 
our goal of detecting authentic questions from classroom discourse. 
Samei et al. [53] focused on identifying authenticity from human- 
transcribed questions from the Partnership for Literacy Study, a 
large sample of over 20,000 questions and associated “gold- 
standard” human codes (see section 2.1). The authors repurposed 
features (e.g., part of speech tags) from an existing speech act 
classifier [44] to train a J48 classifier to detect authenticity of 
individual questions. They achieved a Cohen’s kappa of 0.34 and 
accuracy of 67%, which they deemed promising but in need of 
improvement. 


In a follow-up study, Samei et al. [54] focused on testing the 
generalizability of this model. They split the data based on whether 
it was collected in an urban or non-urban area and whether the 
teacher had been trained in dialogic practices (including the use of 
authentic questions and other effective teacher talk strategies). 
They found that classifiers trained on a subset (e.g. urban) and 
tested on the dual subset (e.g. non-urban) were fairly close in 
accuracy to one another, but that some subpopulations were more 
representative of the data than others, making them better for 
classifier training. 


Of utmost relevance to the present study is work by Olney et al. 
[43] on detecting authentic questions from the aforementioned 
Partnership dataset as well as a newly collected CLASS 5 dataset 
with automatic speech recognition (ASR) transcriptions (see 
Section 2.1). Their main goal was to address heavily imbalanced 
classes, which occur because of the relatively infrequent proportion 
of authentic questions (about 3%) compared to all teacher 
utterances. The class imbalance problem was so severe that they 
forewent identification of individual authentic questions, instead 
focusing on predicting the proportion of all utterances in a class 
session that were authentic questions. In other words, an utterance- 
level binary prediction problem (i.e., labeling an utterance as an 
authentic question or not) was recast as the problem of predicting 
the proportion of authentic questions at the class level. 


Using a combination of 242 pre-defined features, extracted at the 
word, sentence, and discourse level, they first attempted 
aggregating utterance-level predictions of authentic questions, 
obtained with SMOTEBoost [11], to the class level. This yielded 
correlations of 0.27 and 0.44 between the predicted and actual 
(human-coded) authenticity proportions on the Class 5 and 
Partnership datasets, respectively. The difference in correlations 
was attributed to the differences in the degree of class imbalance 
across the two datasets because the Partnership data only contained 


instructional questions whereas the Class 5 data contained all 
teacher utterances. Next, they aggregated their utterance-level 
features to the class level (by taking their mean, sum, and standard 
deviation to yield 726 features) and then trained a MSP regression 
tree [23] on the resulting class-level features. The resulting 
correlation increased from 0.27 to 0.50 for the Class 5 dataset (with 
the most severe imbalance) but remained similar (0.42 vs. 0.44) for 
the Partnership dataset (with minor imbalance). Further 
refinements by Kelly et al. [37], including adding 6 new class-level 
features, resulted in correlations of 0.61 and 0.42 on the Class 5 and 
Partnership datasets, respectively. 


We attempt to improve on these results using an open vocabulary 
approach for class-level authenticity prediction. In an open 
vocabulary approach, the features used to train a classifier are 
determined from the data itself and are not pre-determined. To 
illustrate, albeit in a different domain, Schwartz et al. [56] used an 
open vocabulary approach to predict gender, age, and personality 
traits based on social media posts. They computed counts of words 
and phrases (i.e., n-grams) per participant, and then filtered phrases 
based on pointwise mutual information (PMI) [13, 38], which 
ensured that they only kept phrases with high informational value. 
They then normalized the word and phrase counts by the total 
number of words for each participant and applied the Anscombe 
transformation [3] to the normalized values to stabilize their 
variances. They also generated topics using Latent Dirichlet 
Allocation (LDA) [7, 59]. Using words, phrases, and topics as 
features, the authors were able to predict gender, age, and 
personality traits more accurately than a closed vocabulary 
approach using features from Linguistic Inquiry and Word Count 
(LIWC) [48, 49]. We apply a variant of this basic approach in the 
present study. 


1.2 Novelty and Contributions 

We expand on and improve upon previous work [43] on 
automatically estimating the proportion of authenticity in 
classroom discourse using the same datasets. We call this previous 
approach a closed vocabulary approach since the features are 
predefined and are independent of the dataset. An advantage of the 
closed vocabulary approach is that it is less likely to overfit to the 
dataset at hand because it does not directly encode (as features) 
specific words from the corpus. This might be particularly 
important in the case of classroom discourse because generalizable 
models should encode language that correlates with authentic 
questions vs. being specific to the particular topic being discussed 
in class (e.g., The American Civil War). 


In contrast, an open vocabulary approach uses counts of words and 
phrases found in the corpus. The vocabulary is “open” in that the 
features change depending on the corpus. A potential disadvantage 
of this approach is that it is more likely to overfit to the training 
dataset. However, we think this problem can be alleviated by 
careful selection of words and phrases for use as features. The 
advantage of this approach is that it ostensibly allows for the 
detection of a wider variety of instructional constructs due to a lack 
of pre-determined features. It also yields more interpretable models 
in that one can examine the specific words, phrases, and utterances 
that signal authenticity compared to some of the pre-defined 
features used in the closed vocabulary approach. 


Previous research [56] has indicated that an open vocabulary 
approach outperforms the closed vocabulary approach on a 
different task of gender, age, and personality prediction from social 
media. How might it fare for the present task of authenticity 
prediction and what are the words and phrases that signal 
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authenticity? Is there an advantage to combining both approaches? 
These are the questions that motivated the present study. 


2. METHOD 


2.1 Datasets 

CLASS 5 (new) data. CLASS 5 data were collected between 
January 2014 and May 2016 from 132 classes taught by 14 different 
teachers at seven schools in rural Wisconsin. The data consisted of 
in-class observations in the form of live coding of authenticity by 
trained researchers and subsequent offline refinement of the coding 
from recorded audio. Both teacher and school identifiers were 
preserved with the data. 


Given the logistical constraints of using individual microphones for 
each student, the recording instrumentation instead focused on 
high-quality teacher audio suitable for ASR (see [15] for a 
description of the setup). Classroom audio, which included both 
teacher and student speech, was recorded from a stationary 
boundary microphone, and is not of sufficient quality to be used for 
ASR; it is useful for marking when students speak but is not 
analyzed further here. Thus this dataset differs from the archival 
data (see below) in that the audio is automatically segmented into 
utterances, which are converted into transcripts using Bing Speech 
ASR with accompanying errors. Further, only teacher speech is 
transcribed, and the transcripts contain all utterances rather than 
just questions. 


Partnership (archival) data. The archival data was collected in 
the Partnership for Literacy Study (Partnership), a study of 
professional development, instruction, and literacy outcomes in 
middle school English and language arts classrooms. The study 
collected data from 7th- and 8th- grade English and language arts 
teachers in Wisconsin and New York State from 2001 to 2003. 
Over that two-year period, 119 classrooms in 21 schools were 
observed twice in the fall and twice in the spring. Three of the 
classrooms had missing question data and could not be used for this 
study, leaving us with 116 classrooms. Classroom observations for 
Partnership were conducted using a near-real-time computer-based 
annotation system [41]. The primary focus of the system was to 
annotate the dialogic properties of questions asked by both teachers 
and students. During this process, the instructional questions were 
transcribed by humans, and the transcriptions were mostly accurate, 
but not verbatim. Reliability studies indicate that raters agree on 
question properties approximately 80% of the time, with 
observation-level inter-rater correlations averaging approximately 
95 [42]. 


Table 1 shows a comparison of both datasets. Note that the same 
rubric was used to code authentic questions in both datasets. 


2.2 Natural Language Processing 

Closed vocabulary approach. The closed vocabulary approach 
used 732 specific features to predict the proportion of authentic 
questions in class sessions. This feature set includes specific words 
(like “Why” and “What”), part-of-speech tags, named entity type 
categorizations (such as PERSON, LOCATION, and DATE), 
syntactic dependencies (like subject, direct object, and indirect 
object), and discourse-level features (such as contrast and 
elaboration discourse relations, and joint, nucleus, and satellite 
elementary discourse units). There were 242 utterance-level 
features, which were aggregated at the class level by taking their 
mean, sum, and standard deviation [43]. Two more features were 
later added at the utterance level, leading to six more features at the 
class level, for a total of 732 class-level features [37]. 


Open vocabulary approach. The open vocabulary approach used 
a variable number of features depending on the dataset. This 
method was adapted from the open vocabulary language model 
developed by Park et al. [46]. To start, counts of words, two-word 
phrases, and three-word phrases were computed from the corpus. 
See Table 1 for a comparison of n-gram counts prior to filtering 
(see below). 


We used a stop word list from Pedregosa et al. [47] to filter out the 
most common English words (such as “the” and “and”), and so 
these words and phrases including them were filtered out. We also 
required each word or phrase to occur in at least some percentage 
of documents, which we call the cutoff (we investigated multiple 
cutoffs, with results shown in Section 3). 


We then calculated the pointwise mutual information (PMI) of each 
phrase, defined as: 


p(phrase) 


pmi(phrase) = lon GT nlward) 


where p(phrase) is the probability of a phrase based on its relative 
frequency in the training data and I] p(word) is the product of the 
probabilities of each word in the phrase in the training data. We 
filtered out phrases where the PMI was less than three times the 
number of words in the phrase [13, 38]. This helped ensure that we 
only used meaningful phrases (such as “language arts”), rather than 
phrases that were just the result of frequent words occurring next to 
one another (such as “next we will”). We experimented with PMI 
thresholds ranging from zero to four times the number of words in 
the phrase, but no difference in performance was observed. Cutoff 
and PMI filtering were based only on data in the training folds, 
ensuring that the test was not affected (see Section 2.3). 


Combined approach. We simply averaged predictions from the 
closed and open vocabulary approaches. 


Table 1. Summary of the two datasets 


Item Class 5 Partnership 
# Utterances 45,044 Unknown 

# Instructional Questions 4,377 25,711 

# Authentic Questions 1,510 12,862 

% Authentic Utterances 3% Unknown 

% Authentic Questions 34% 50% 
Unigrams 17,520 8,358 
Bigrams 152,023 61,460 
Trigrams 319,545 117,049 


Note. % Authentic Utterances refers to teacher utterances aligned 
with authentic questions. % Authentic Questions refers to 
instructional questions that were also authentic. N-gram counts are 
prior to filtering. 


2.3 Model Training 


We used M5P model trees, which are decision trees that have 
regression functions at each leaf node [23]. Starting at the root of 
the tree, decisions to follow a left or right branch are based on the 
value of a particular feature until a leaf with the appropriate 
regression model is reached. We chose the M5P model to enable 
comparisons with previous work [43]. 


All models used cross-validation, with selection of words and 
phrases to use as features for the open vocabulary approach based 
only on the training folds; we did not peek into the testing folds. 
For generalizability to new teachers, it was important that a teacher 
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would not appear in both the training and testing folds. For the 
CLASS 5 data, this was achieved using leave-one-teacher-out 
cross-validation. For the archival Partnership data, the mapping 
between teachers and data files was incomplete, and so the mapping 
between schools and data files was used instead. This leave-one- 
school-out cross-validation assumes that a teacher did not transfer 
between schools during the study (a likely assumption), and in a 
sense is even more conservative than leave-one-teacher-out 
because it controls for similarities shared by teachers at the same 
school. 


It should be noted that the unit of analysis is always a class-session. 
That is, counts for the language model, feature aggregation, and 
authenticity aggregation are all done at the level of an individual 
class-session. 


2.4 Method Pseudocode 


Below is pseudocode outlining our method for teacher-level cross-validation. 


Aggregate utterance-level transcripts to the class session level 


For each cutoff percentage: 
For each teacher: 


Split data into training set (class sessions from other teachers) and 
test set (class sessions from this teacher) 


Get counts of n-grams (words, bigrams, 


and trigrams) for each class session in training set 


Remove n-grams that contain words from stop word list 
Remove n-grams that appear less than once in cutoff percentage of class sessions 
Filter phrases (bigrams and trigrams) using pointwise mutual information 
Get counts of kept n-grams for each class session in test set 
Train M5P model on n-gram counts from training set class sessions 
Use M5P model to predict authenticity on test set class sessions 
Pool class session authenticity predictions across teachers 
Compute correlation between predicted and actual authenticities for cutoff percentage 


3. RESULTS 


Our outcome measure is the Pearson correlation between the 
computer- and human-coded estimates of proportion authenticity 
per class session. We recomputed the previous results [37] obtained 
with the closed vocabulary approach and replicated the previous 
findings. 


3.1 Cutoff Percentage (Open Vocabulary 
Approach) 


As mentioned in Section 2.2, we tested various cutoff percentages 
for the open vocabulary approach. As can be seen in Figure 1, the 
correlation starts out low as the model is overwhelmed by the sheer 
number of features (Figure 2). However, as the cutoff becomes 
more stringent and the number of features decreases, the results 
improve, until the correlations peaks at 0.602, achieved with 52 
features at an 82% cutoff. Beyond this point, the correlation steeply 
drops as too few features remain. 


We observed a different pattern for the Partnership data as noted in 
Figure 3 and Figure 4. Here, the results were less dependent on the 
number of features, though the best correlation of 0.396 was 
obtained at the 61% cutoff with only 6 features retained. It should 
be noted that we only considered up to a 70% cutoff for this dataset 
because there were only three remaining features beyond this point. 
This is unsurprising because the Partnership data, though more 
diverse, only contains questions compared to the full transcripts in 
the CLASS 5 dataset, and consequently contains far fewer unique 
n-grams (see Section 2.2). 
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Figure 1. Correlation by cutoff % for Class 5 dataset 
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Figure 2. # of features by cutoff % for Class 5 dataset 
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Figure 3. Correlation by cutoff % for the Partnership dataset 
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Figure 4. # of features by cutoff % for the Partnership dataset 


3.2, Comparison with Closed Vocabulary 
Results 


For the Class 5 data, the best correlation of 0.602 obtained via the 
open vocabulary approach was significant (p < .001) and similar to 
the significant 0.613 (p < .001) correlation obtained from the closed 
vocabulary approach. Zou’s [66] test of the difference between two 
overlapping dependent correlations with one common variable (i.e., 
the gold-standard authenticity codes) indicated that the two 
correlation coefficients were statistically equivalent at the p < .05 
level. A similar pattern of results was obtained for the Partnership 
data in that the significant 0.396 (p < .001) correlation from the 
open vocabulary approach was statistically equivalent to the 0.421 
significant (p < .001) correlation from the closed vocabulary 
approach at the p < .05 level. Subsequent results focus on these two 
“best” models. 


3.3 Combined Models 


The analyses thus far indicate that the closed and open vocabulary 
approaches were equally predictive of authenticity across both 


datasets. Authenticity estimates from both methods correlated at 
.559 (p < .001) and .371 (p < .001) for the Class 5 and Partnership 
datasets, respectively, suggesting some, but not substantial, 
redundancy. This raises the question of whether a combination of 
the two approaches might improve predictive power. 


We addressed this question by averaging the predictions of the two 
best models (we also attempted feature-level fusion, but this 
resulted in lower performance; results not shown here). For Class 
5, the combined model predicted authenticity with a significant 
correlation of .686 (p < .091), which was quantitatively and 
statistically higher (p < .05) than the 0.602 and 0.613 correlations 
obtained from the open and closed vocabulary approaches, 
respectively (see Figure 5). 


0.75 Closed Vocab m= Open Vocab = Combined 
S 

2 

is 

0.5 

3 

a i 
0.25 


CLASS 5 Partnership 


Figure 5. Comparison of closed, open, and combined models 


These results can be visualized as a density plot (see left of Figure 
6). The plot illustrates smoothed histograms of class-level 
computer- and human-provided proportional authenticity 
estimates. We note the combined model tends to slightly 
overestimate the mean compared to the human-coded data. Its 
predictions are also less positively skewed, ostensibly because it 
underpredicts some cases with considerable human-coded 
authenticity (also see right of Figure 6). 


A similar pattern of results was obtained for the Partnership data. 
Specifically, the combined model’s correlation of .492 was 
significant (p < .001) and also significantly higher (p < .05) than 
the 0.396 and 0.421 correlations obtained from the open and closed 
vocabulary approaches, respectively (see Figure 5). As noted in the 
density plot in Figure 7, the combined model is “peakier” with a 
reduced range in either direction compared to the human-coded 
data. The model has difficulty with cases associated with very low 
and very high human-coded authenticity (see scatterplot in Figure 
7). 
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Figure 6. Density plot and scatter plot showing the resulting predictions from combining both the open and closed vocabulary 
models on the Class 5 dataset compared to human codes. 
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Figure 7. Density plot and scatter plot showing the resulting predictions from combining both the open and closed vocabulary 
models on the Partnership dataset compared to human codes. 
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3.4 Feature Analysis 

We investigated the features (words and phrases) from the best 
open vocabulary model in the form of word clouds! scaled using 
correlations of individual features with authenticity rather than by 
absolute frequency in the corpus. Figure 8 shows words that 
positively correlate with authenticity for the Class 5 dataset. The 
words “Question,” “Maybe,” and “Ok” correlated most strongly 
with authenticity (correlation values of .254, .229, and .219 
respectively). These words are used to ask questions, indicate 
uncertainty, or to accept another’s response. This might suggest the 
teacher is setting the stage for open dialogue, which is precisely 
what authentic questioning signals. 


ra a 
Maybe 
Question 
Good 


Figure 8. Words that are positively correlated with 
authenticity in the Class 5 dataset. 


Alternatively, the words “Need,” “Work,” and “Doing” were most 
negatively correlated with authenticity (correlation values of -.383, 
-.330, and -.302 respectively) — see Figure 9 for the full word cloud. 
These words might be more likely to occur during non-dialogic 
activities, such as lecture or individual work. 
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Figure 9. Words and phrases that are negatively correlated 
with authenticity for the Class 5 dataset. 


For the Partnership dataset, only “Like,” “Think,” and “Say” were 
positively correlated with authenticity (correlation values of .177, 
.158, and .055 respectively). It is plausible that these terms 
accompany more open-ended authentic questions (e.g., “Why do 
you like the last story?” or “What do you think about that?” or “Why 
did you say that?) compared to their non-authentic counterparts 
that solicit specific responses (e.g., “What do we know about the 
beginning?” — these are all hypothetical examples). 


There were also only three words that negatively correlated with 
authenticity. “Does” was more strongly correlated than “Know” 
and “Did” (correlation values of -.246, -.062, and -.032 
respectively). “Does” might be more likely to accompany 
information-seeking questions, such as “What does mandible 


' Word clouds were generated via https://worditout.com 


mean?” or “How does Jim know he is in danger?” compared to 
more authentic questions. Of course, these are only speculative 
suggestions that need to be verified by more systematic analyses. 


4. DISCUSSION 


We addressed the task of automated prediction of the proportion of 
authentic questions in a class session from real-world classroom 
discourse. We compared a previous closed vocabulary approach to 
an open vocabulary approach, combined the two, and tested them 
on two datasets. In the remainder of this section, we discuss our 
main findings, possible applications of this work, as well as 
limitations and directions for future work. 


4.1 Main Findings 


We found that the open and closed vocabulary approaches yielded 
equitable performance on both datasets, but a simple combination 
of the two resulted in statistically better results. This suggests that 
knowledge of the domain, as reflected in some of the closed 
vocabulary features (the question specific ones), is very important, 
but missed patterns can be captured using the open vocabulary 
approach. Thus, the combined approach capitalized on the strengths 
while mitigating the weaknesses of each individual approach. 


The fact that the result replicated across two rather different 
datasets increases our confidence in the findings. This is 
particularly important because the datasets differ in a number of 
substantial ways — for example, one contained ASR transcripts of 
entire class sessions while the other contained human transcriptions 
of question text; one was much more variable, larger in size, and 
was validated at the school-level compared to the smaller, more 
homogenous dataset that was validated at the teacher level. 


The open vocabulary approach provided key insights into the 
specific words used to guide its predictions. Of particular interest 
was the fact that the word “think” was positively correlated with 
authenticity in both datasets, but the word “like” was negatively 
correlated with authenticity in one and positively in another. This 
suggests the importance of examining the broader context in which 
these words appear. 


4.2 Applications 

Like anyone, teachers need feedback to improve. But in contrast to 
an expert musician or athlete who receives continual feedback 
across the countless hours spent in practice for the occasional 
performance, a _ teacher delivers approximately 1,000 
“performances” a year with almost no feedback [22, 60]. Given the 
pivotal role of feedback to learning [5, 14, 21, 57], the lack of 
immediate and objective feedback is a critical barrier that needs to 
be cracked if we are truly going to innovate teaching. 


Accordingly, one key application of our work is in an automated 
teacher feedback system with the goal of improving teaching 
effectiveness and consequently student learning. Such a system 
needs to be able to detect different measures of teaching 
effectiveness beyond authentic questions (e.g., goal clarity, 
disciplinary concepts, strategy use, elaborated feedback), and the 
open vocabulary approach is particularly suited for this task. 


Ultimately, we envision technology that will autonomously analyze 
teachers’ behaviors as they go about their daily activities, both 
within and beyond the classroom. The technology would provide 
formative feedback (i.e., feedback aimed at improvement rather 
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than evaluation [57]), which the teacher can use as a form of DIY 
(do it yourself) professional development or share with support 
staff. The feedback can enable reflective practice, defined as 
thoughtfully considering one's own actions and experiences to 
refine one’s skill in a selected discipline [55]. Due to its emphasis 
on contextualized analysis and metacognition, reflective practice 
holds great promise in improving teaching effectiveness [9, 10], 
which should result in positive downstream influences on student 
achievement given the robust relationship between the two [12, 17, 
29, 34, 51, 52, 65]. 


Such a technology can also be used to streamline research into 
teaching effectiveness, which currently relies on cumbersome 
human observation (see the introduction). Going beyond question 
authenticity, at a broader level, such a technology could be used to 
advance basic research on student-teacher discourse, essentially 
opening up the methods of “big data” science to real-world 
classrooms. 


4.3 Limitations & Future Work 


One limitation of this study is the amount and variety of classroom 
transcriptions with corresponding authenticity labels. The Class 5 
dataset was collected in a very limited geographical location. The 
Partnership dataset, although much more variable in terms of the 
sample, only included transcriptions of questions rather than 
transcriptions of all teacher utterances. 


Our models also detect authenticity at the level of an entire class 
session, rather than at the individual utterance level. Finer grain size 
is needed to provide actionable feedback to teachers, at least with 
respect to the vision articulated above. We also did not correlate 
our results with more objective measures, particularly achievement 
growth, due to a lack of available data. 


In addition to addressing the aforementioned limitations, future 
work should include using the open vocabulary approach to predict 
measures beyond authenticity. We are taking a step in this direction 
by re-coding current CLASS 5 audio as well as collecting new 
audio files and coding them for the following broader dimensions 
of discourse linked, or hypothesized to be linked, to student 
achievement growth: goal clarity, disciplinary concepts, and 
strategy use for teacher-led discourse, and challenge, connection, 
and elaborated feedback for transactional discourse. 


We are also streamlining the data collection process, essentially 
providing usable tools for teachers to collect their own data, and 
have collected over 65 hours of audio (in about two months) using 
this approach. When coupled with existing data from CLASS 5, we 
estimate that the combined datasets will be sufficiently large to 
experiment with deep natural language processing methods, such 
as long short-term recurrent neural networks [31] and hierarchical 
attention networks [64]. 


4.4 Concluding Remarks 

We applied an open vocabulary approach to the task of predicting 
authentic questions in classroom discourse and compared it to a 
previous closed vocabulary approach applied to the same problem. 
We found that the two approaches yielded equivalent performance, 
but a combination led to higher accuracies than either method 
alone. We achieved a correlation of close to 0.70 on real-world 
audio, which suggests that fully-automated methods might 
complement or even replace humans on the difficult task of 
determining the level of dialogism in classroom discourse. 
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